Search CORE

516 research outputs found

Bootstrapping named entity resources for adaptive question answering systems

Author: Pablo Sánchez César de
Publication venue
Publication date: 22/10/2010
Field of study

Los Sistemas de Búsqueda de Respuestas (SBR) amplían las capacidades de un buscador de información tradicional con la capacidad de encontrar respuestas precisas a las preguntas del usuario. El objetivo principal es facilitar el acceso a la información y disminuir el tiempo y el esfuerzo que el usuario debe emplear para encontrar una información concreta en una lista de documentos relevantes. En esta investigación se han abordado dos trabajos relacionados con los SBR. La primera parte presenta una arquitectura para SBR en castellano basada en la combinación y adaptación de diferentes técnicas de Recuperación y de Extracción de Información. Esta arquitectura está integrada por tres módulos principales que incluyen el análisis de la pregunta, la recuperación de pasajes relevantes y la extracción y selección de respuestas. En ella se ha prestado especial atención al tratamiento de las Entidades Nombradas puesto que, con frecuencia, son el tema de las preguntas o son buenas candidatas como respuestas. La propuesta se ha encarnado en el SBR del grupo MIRACLE que ha sido evaluado de forma independiente durante varias ediciones en la tarea compartida CLEF@QA, parte del foro de evaluación competitiva Cross-Language Evaluation Forum (CLEF). Se describen aquí las participaciones y los resultados obtenidos entre 2004 y 2007. El SBR de MIRACLE ha obtenido resultados moderados en el desempeño de la tarea con tasas de respuestas correctas entre el 20% y el 30%. Entre los resultados obtenidos destacan los de la tarea principal de 2005 y la tarea piloto de Búsqueda de Respuestas en tiempo real de 2006, RealTimeQA. Esta última tarea, además de requerir respuestas correctas incluía el tiempo de respuesta como un factor adicional en la evaluación. Estos resultados respaldan la validez de la arquitectura propuesta como una alternativa viable para los SBR sobre colecciones textuales y también corrobora resultados similares para el inglés y otras lenguas. Por otro lado, el análisis de los resultados a lo largo de las diferentes ediciones de CLEF así como la comparación con otros SBR apunta nuevos problemas y retos. Según nuestra experiencia, los sistemas de QA son más complicados de adaptar a otros dominios y lenguas que los sistemas de Recuperación de Información. Este problema viene heredado del uso de herramientas complejas de análisis de lenguaje como analizadores morfológicos, sintácticos y semánticos. Entre estos últimos se cuentan las herramientas para el Reconocimiento y Clasificación de Entidades Nombradas (NERC en inglés) así como para la Detección y Clasificación de Relaciones (RDC en inglés). Debido a la di cultad de adaptación del SBR a distintos dominios y colecciones, en la segunda parte de esta tesis se investiga una propuesta diferente basada en la adquisición de conocimiento mediante métodos de aprendizaje ligeramente supervisado. El objetivo de esta investigación es adquirir recursos semánticos útiles para las tareas de NERC y RDC usando colecciones de textos no anotados. Además, se trata de eliminar la dependencia de herramientas de análisis lingüístico con el fin de facilitar que las técnicas sean portables a diferentes dominios e idiomas. En primer lugar, se ha realizado un estudio de diferentes algoritmos para NERC y RDC de forma semisupervisada a partir de unos pocos ejemplos (bootstrapping). Este trabajo propone primero una arquitectura común y compara diferentes funciones que se han usado en la evaluación y selección de resultados intermedios, tanto instancias como patrones. La principal propuesta es un nuevo algoritmo que permite la adquisición simultánea e iterativa de instancias y patrones asociados a una relación. Incluye también la posibilidad de adquirir varias relaciones de forma simultánea y mediante el uso de la hipótesis de exclusividad obtener mejores resultados. Como característica distintiva el algoritmo explora la colección de textos con una estrategia basada en indización, que permite adquirir conocimiento de grandes colecciones. La estrategia de selección de candidatos y la evaluación se basan en la construcción de un grafo de instancias y patrones, que justifica nuestro método para la selección de candidatos. Este procedimiento es semejante al frente de exploración de una araña web y permite encontrar las instancias más parecidas a las semillas con las evidencias disponibles. Este algoritmo se ha implementado en el sistema SPINDEL y para su evaluación se ha comenzado con el caso concreto de la adquisición de recursos para las clases de Entidades Nombradas más comunes, Persona, Lugar y Organización. El objetivo es adquirir nombres asociados a cada una de las categorías así como patrones contextuales que permitan detectar menciones asociadas a una clase. Se presentan resultados para la adquisición de dos idiomas distintos, castellano e inglés, y para el castellano, en dos dominios diferentes, noticias y textos de una enciclopedia colaborativa, Wikipedia. En ambos casos el uso de herramientas de análisis lingüístico se ha limitado de acuerdo con el objetivo de avanzar hacia la independencia de idioma. Las listas adquiridas mediante bootstrapping parten de menos de 40 semillas por clase y obtienen del orden de 30.000 instancias de calidad variable. Además se obtienen listas de patrones indicativos asociados a cada clase de entidad. La evaluación indirecta confirma la utilidad de ambos recursos en la clasificación de Entidades Nombradas usando un enfoque simple basado únicamente en diccionarios. La mejor configuración obtiene para la clasificación en castellano una medida F de 67,17 y para inglés de 55,99. Además se confirma la utilidad de los patrones adquiridos que en ambos casos ayudan a mejorar la cobertura. El módulo requiere menor esfuerzo de desarrollo que los enfoques supervisados, si incluimos la necesidad de anotación, aunque su rendimiento es inferior por el momento. En definitiva, esta investigación constituye un primer paso hacia el desarrollo de aplicaciones semánticas como los SBR que requieran menos esfuerzo de adaptación a un dominio o lenguaje nuevo.-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------Question Answering (QA) systems add new capabilities to traditional search engines with the ability to find precise answers to user questions. Their objective is to enable easier information access by reducing the time and effort that the user requires to find a concrete information among a list of relevant documents. In this thesis we have carried out two works related with QA systems. The first part introduces an architecture for QA systems for Spanish which is based on the combination and adaptation of different techniques from Information Retrieval (IR) and Information Extraction (IE). This architecture is composed by three modules that include question analysis, relevant passage retrieval and answer extraction and selection. The appropriate processing of Named Entities (NE) has received special attention because of their importance as question themes and candidate answers. The proposed architecture has been implemented as part of the MIRACLE QA system. This system has taken part in independent evaluations like the CLEF@QA track in the Cross-Language Evaluation Forum (CLEF). Results from 2004 to 2007 campaigns as well as the details and the evolution of the system have been described in deep. The MIRACLE QA system has obtained moderate performance with a first answer accuracy ranging between 20% and 30%. Nevertheless, it is important to highlight the results obtained in the 2005 main QA task and the RealTimeQA pilot task in 2006. The last one included response time as an important additional variable of the evaluation. These results back the proposed architecture as an option for QA from textual collection and confirm similar findings obtained for English and other languages. On the other hand, the analysis of the results along evaluation campaigns and the comparison with other QA systems point problems with current systems and new challenges. According to our experience, it is more dificult to tailor QA systems to different domains and languages than IR systems. The problem is inherited by the use of complex language analysis tools like POS taggers, parsers and other semantic analyzers, like NE Recognition and Classification (NERC) and Relation Detection and Characterization (RDC) tools. The second part of this thesis tackles this problem and proposes a different approach to adapting QA systems for di erent languages and collections. The proposal focuses on acquiring knowledge for the semantic analyzers based on lightly supervised approaches. The goal is to obtain useful resources that help to perform NERC or RDC using as few annotated resources as possible. Besides, we try to avoid dependencies from other language analysis tools with the purpose that these methods apply to different languages and domains. First of all, we have study previous work on building NERC and RDC modules with few supervision, particularly bootstrapping methods. We propose a common framework for different bootstrapping systems that help to unify different evaluation functions for intermediate results. The main proposal is a new algorithm that is able to simultaneously acquire instances and patterns associated to a relation of interest. It also uses mutual exclusion among relations to reduce concept drift and achieve better results. A distinctive characteristic is that it uses a query based exploration strategy of the text collection which enables their use for larger collections. Candidate selection and evaluation are based on incrementally building a graph of instances and patterns which also justifies our evaluation function. The discovery approach is analogous to the front of exploration in a web crawler and it is able to find the most similar instances to the available seeds. This algorithm has been implemented in the SPINDEL system. We have selected for evaluation the task of acquiring resources for the most common NE classes, Person, Location and Organization. The objective is to acquire name instances that belong to any of the classes as well as contextual patterns that help to detect mentions of NE that belong to that class. We present results for the acquisition of resources from raw text from two different languages, Spanish and English. We also performed experiments for Spanish in two different collections, news and texts from a collaborative encyclopedia, Wikipedia. Both cases are tackled with limited language analysis tools and resources. With an initial list of 40 instance seeds, the bootstrapping process is able to acquire large name lists containing up to 30.000 instances with a variable quality. Besides, large lists of indicative patterns are obtained too. Our indirect evaluation confirms the utility of both resources to classify NE using a simple dictionary recognition approach. Best results for Spanish obtained a F-score of 67,17 and for English this value is 55,99. The module requires much less development effort than annotation for supervised algorithms although the performance is not in pair yet. This research is a first step towards the development of semantic applications like QA for a new language or domain with no annotated corpora that requires less adaptation effort

Universidad Carlos III de Madrid e-Archivo

Evaluation of Named Entity Recognition in Spanish with OpenCalais

Author: Martínez Paloma
Pablo-Sánchez César de
Toribio Raquel
Publication venue: Sociedad Española para el Procesamiento del Lenguaje Natural
Publication date: 01/01/2010
Field of study

En los últimos años se han popularizado herramientas de Extracción de Información comerciales dentro del ecosistema de servicios de la Web Semántica. OpenCalais ofrece actualmente reconocimiento y categorización de Entidades Nombradas en castellano de fácil integración en aplicaciones de PLN. Hemos evaluado esta herramienta de anotación de entidades en el corpus de noticias CoNLL 2002. OpenCalais obtiene valores de precisión aceptables en las principales clases (persona, lugares y organización). Sin embargo, en comparación con los prototipos de investigación en castellano puede mejorar la cobertura y el tratamiento de la ambigüedad.The Semantic Web ecosystem has seen the growing popularity of commercial Information Extraction services. Among them, OpenCalais provides Named Entity Recognition and Classification in Spanish. We have evaluated this service in the CONLL 2002 news corpus. The precision results are good enough for the development of applications that use the main classes (person, location and organization). However, recall and the treatment of ambiguous entities could be improved to be in pair with research prototypes.Este trabajo ha sido parcialmente financiado por la red MA2VICMR (S2009/TIC-1542) y por el proyecto BRAVO (TIN2007-67407-C03-01)Publicad

Repositorio Institucional de la Universidad de Alicante

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Universidad Carlos III de Madrid e-Archivo

Anonimytext: anonimization of unstructured documents

Author: Iglesias Ana
Pablo-Sánchez César de
Pérez Laínez Rebeca
Publication venue: INSTICC (Institute for Systems and Technologies of Information, Control and Communication)
Publication date: 01/01/2009
Field of study

Proceedings of: The International Conference on Knowledge Discovery and Information Retrieval, October, 2009 (KDIR 2009). First International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2009), Funchal (Madeira, Portugal)The anonymization of unstructured texts is nowadays a task of great importance in several text mining applications. Medical records anonymization is needed both to preserve personal health information privacy and enable further data mining efforts. The described ANONYMITEXT system is designed to de identify sensible data from unstructured documents. It has been applied to Spanish clinical notes to recognize sensible concepts that would need to be removed if notes are used beyond their original scope. The system combines several medical knowledge resources with semantic clinical notes induced dictionaries. An evaluation of the semi automatic process has been carried on a subset of the clinical notes on the most frequent attributes.This work has been partially supported by MAVIR (S 0505/TIC 0267) and by the TIN2007 67407 C03 01 project BRAVO

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Universidad Carlos III de Madrid e-Archivo

Combining Syntactic Information and Domain-Specific Lexical Patterns to Extract Drug-Drug Interactions from Biomedical Texts

Author: Martínez Fernández Paloma
Pablo-Sánchez César de
Segura-Bedmar Isabel
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2010
Field of study

Proceeding at: 19th ACM International Conference on Information and Knowdledge Management. Took place October, 26-30 2010, in Toronto, Canada. The event Web site is http://www.yorku.ca/cikm10/A drug-drug interaction (DDI) occurs when one drug influences the level or activity of another drug. The increasing volume of the scientific literature overwhelms health care professionals trying to be kept up-to-date with all published studies on DDI. Information Extraction (IE) techniques can provide an interesting way of reducing the time spent by health care professionals on reviewing the literature. Nevertheless, no approach has been carried out to extract DDI from texts. To the best of our knowledge, this work proposes the first integral solution for the automatic extraction of DDI from biomedical texts.This work has been partially supported by the Spanish research projects: MA2VICMR consortium (S2009/TIC-1542, www.mavir.net), a network of excellence funded by the Madrid Regional Government and TIN2007-67407-C03-01 (BRAVO: Advanced Multimodal and Multilingual Question Answering).Publicad

Crossref

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Universidad Carlos III de Madrid e-Archivo

Combining Similarities with Regression based Classiffiers for Entity Linking at TAC 2010

Author: Martínez Paloma
Pablo-Sánchez César de
Perea Moraleda Juan Ignacio
Publication venue: 'National Institute of Standards and Technology (NIST)'
Publication date: 01/11/2010
Field of study

The papers of: Workshop Text Analysis Conference 2010 (TAC 2010). Took place 2010, November 15-16, in National Institute of Standards and Technology. Gaithersburg, Maryland USA. The event Web site in http://www.nist.gov/tac/2010/workshop/index.htmlThe UC3M team has sent three runs for each Entity Linking task proposed in the Knowledge Base Population (KBP) track at TAC 2010. The skeleton system presented in 2009 has evolved in 2010 by incor- porating some new tools, new algorithms for candidate retrieval and feature extraction, and two new stages that use regression based clas- si ers for candidate ltering. These improvements have allowed the UC3M team overall values to almost reach the median values of all participants in the Entity Linking tasks.This work has been partially supported by the Regional Government of Madrid by means of the Research Network MAVIR2CM (S2009/TIC-1542) and by the Spanish Ministry of Education by means of the project BRAVO (TIN2007-67407-C3-01)Publicad

Universidad Carlos III de Madrid e-Archivo

Using a shallow linguistic kernel for drug-drug interaction extraction

Author: Martínez Fernández Paloma
Pablo-Sánchez César de
Segura-Bedmar Isabel
Publication venue: 'Elsevier BV'
Publication date: 01/01/2011
Field of study

A drug–drug interaction (DDI) occurs when one drug influences the level or activity of another drug. Information Extraction (IE) techniques can provide health care professionals with an interesting way to reduce time spent reviewing the literature for potential drug–drug interactions. Nevertheless, no approach has been proposed to the problem of extracting DDIs in biomedical texts. In this article, we study whether a machine learning-based method is appropriate for DDI extraction in biomedical texts and whether the results provided are superior to those obtained from our previously proposed pattern-based approach [1]. The method proposed here for DDI extraction is based on a supervised machine learning technique, more specifically, the shallow linguistic kernel proposed in Giuliano et al. (2006) [2]. Since no benchmark corpus was available to evaluate our approach to DDI extraction, we created the first such corpus, DrugDDI, annotated with 3169 DDIs. We performed several experiments varying the configuration parameters of the shallow linguistic kernel. The model that maximizes the F-measure was evaluated on the test data of the DrugDDI corpus, achieving a precision of 51.03%, a recall of 72.82% and an F-measure of 60.01%. To the best of our knowledge, this work has proposed the first full solution for the automatic extraction of DDIs from biomedical texts. Our study confirms that the shallow linguistic kernel outperforms our previous pattern-based approach. Additionally, it is our hope that the DrugDDI corpus will allow researchers to explore new solutions to the DDI extraction problem.This study was funded by the Projects MA2VICMR (S2009/TIC-1542) and MULTIMEDICA (TIN2010-20644-C03-01).Publicad

Elsevier - Publisher Connector

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Universidad Carlos III de Madrid e-Archivo

Lightly supervised acquisition of named entities and linguistic patterns for multilingual text mining

Author: Iglesias Maqueda Ana María
Martínez Fernández Paloma
Pablo-Sánchez César de
Segura-Bedmar Isabel
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2012
Field of study

Named Entity Recognition and Classiﬁcation (NERC) is an important component of applications like Opinion Tracking, Information Extraction, or Question Answering. When these applications require to work in several languages, NERC becomes a bottleneck because its development requires language-speciﬁc tools and resources like lists of names or annotated corpora. This paper presents a lightly supervised system that acquires lists of names and linguistic patterns from large raw text collections in western languages and starting with only a few seeds per class selected by a human expert. Experiments have been carried out with English and Spanish news collections and with the Spanish Wikipedia. Evaluation of NE classiﬁcation on standard datasets shows that NE lists achieve high precision and reveals that contextual patterns increase recall significantly. Therefore, it would be helpful for applications where annotated NERC data are not available such as those that have to deal with several western languages or information from different domains.This researchwork has been supported by the Regional Government of Madrid under the Research Network MA2VICMR (S2009/TIC-1542), by the Spanish Ministry of Education under the project MULTIMEDICA (TIN2010-20644-C03-01) and by the Spanish Center for Industry Technological Development (CDTI, Ministry of Industry, Tourism and Trade), through the BUSCAMEDIA Project (CEN-20091026)

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Universidad Carlos III de Madrid e-Archivo

DAEDALUS at RepLab 2014: Detecting RepTrak reputation dimensions on tweets

Author: García Morera Janine
González Cristóbal José Carlos
Pablo Sánchez César de
Villena Román Julio
Publication venue: E.T.S.I. Telecomunicación (UPM)
Publication date: 01/01/2014
Field of study

This paper describes our participation at the RepLab 2014 reputation dimensions scenario. Our idea was to evaluate the best combination strategy of a machine learning classifier with a rule-based algorithm based on logical expressions of terms. Results show that our baseline experiment using just Naive Bayes Multinomial with a term vector model representation of the tweet text is ranked second among runs from all participants in terms of accuracy

CiteSeerX

Archivo Digital UPM

MIRACLE’s hybrid approach to bilingual and monolingual Information Retrieval

Author: Alonso Sánchez Javier
García Serrano Ana
González Cristóbal José Carlos
Goñi Menoyo José Miguel
Martínez Fernández José Luis
Martínez Fernández Paloma
Pablo Sánchez César de
Villena Román Julio
Publication venue: E.T.S.I. Telecomunicación (UPM)
Publication date: 01/01/2004
Field of study

The main goal of the bilingual and monolingual participation of the MIRACLE team at CLEF 2004 was testing the effect of combination approaches to information retrieval. The starting point is a set of basic components: stemming, transformation, filtering, generation of n-grams, weighting and relevance feedback. Some of these basic components are used in different combinations and order of application for document indexing and for query processing. Besides this, a second order combination is done, mainly by averaging or by selective combination of the documents retrieved by different approaches for a particular query

Archivo Digital UPM

Using UML’s Sequence Diagrams for Representing Execution Models Associated to Triggers

Author: Al-Jumaily Harith T.
Cuadra Fernández María Dolores
Martínez Fernández Paloma
Pablo-Sánchez César de
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2006
Field of study

11 pages, 3 figures.-- Contributed to: 23rd British National Conference on Databases (BNCOD 23, Belfast, Northern Ireland, UK, July 18-20, 2006).Using active rules or triggers to verify integrity constraints is a serious and complex problem because these mechanisms have behaviour that could be difficult to predict in a complex database. The situation is even worse as there are few tools available for developing and verifying them. We believe that automatic support for trigger development and verification would help database developers to adopt triggers in the database design process. Therefore, in this work we suggest a visualization add-in tool that represents and verifies triggers execution by using UML’s sequence diagrams. This tool is added in RATIONAL ROSE and it simulates the execution sequence of a set of triggers when a DML operation is produced. This tool uses the SQL standard to express the triggers semantics and execution.This work is part of the project "Software Process Management Platform: modelling, reuse and measurement". TIN2004/07083.Publicad

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Universidad Carlos III de Madrid e-Archivo